Notes from Video instructions
Expectation is that we explore the data. Introduction lays out a much more robust background.
Methods section 1 is where we introduce some exploratory analysis, explore correlation analysis, scatterplots, colinearity, explore.
Methods section 2 is where we take our exploratory analysis then build some models and maybe we stop at additive, maybe we go from additive to interactive. analyze RMSE, adj.r.squared, residual std error. -No matter what we then look for how to improve our additive model by shrinking the model intelligently or growing the model.
choose a couple of models into a table
– model, AIC/BIC, number of variables, RMSE, some stats
The goal of the project is to explore the data, not make the
best model
The best way to fail is to make a path and just stick to it. Don’t put
the blinders on.
The data we have chosen to look at is Housing Prices in California. This data comes from Kaggle and outlines data that would go in to predicting the price of a house in California. As people who currently rent (and one of us living in California), we hope to one day be able to purchase a home and being able to understand this model could help us determine important factors in predicting the price and whether future ones we intend to buy are a good deal or not.
In this document we will modify and assess the data, then using our asssessments attempt to build a good model which does not over fit or under fit the data. We hope to be able to create a general model that based on some factors about a given property can give an expected price as an output.
Original Variables:
Median_House_Value: Median house value for
households within a block (measured in US Dollars) [$]
Median_Income: Median income for households within a
block of houses (measured in tens of thousands of US Dollars)
[$10k]
Median_Age: Median age of a house within a block; a
lower number is a newer building [years]
Total_Rooms: Total number of rooms within a
block
Total_Bedrooms: Total number of bedrooms within a
block
Population: Total number of people residing within a
block
Households: Total number of households, a group of
people residing within a home unit, for a block
Latitude: A measure of how far north a house is; a
higher value is farther north [°]
Longitude: A measure of how far west a house is; a
higher value is farther west [°]
Distance_to_coast: Distance to the nearest coast
point [m]
Distance_to_Los_Angeles: Distance to the center of
Los Angeles [m]
Distance_to_San_Diego: Distance to the center of San
Diego [m]
Distance_to_San_Jose: Distance to the center of San
Jose [m]
Distance_to_San_Francisco: Distance to the center of
San Francisco [m]
Variables added by the group later in the project:
dist_to_nearest_city: The numeric minimum value of
variables 11 through 14 divided by 1000 to convert to km. [km]
nearest_city: Categorical variable indicating which
city of those listed in variable 11 through 14 was the closest.[city
name]
near_a_city_100: a factor variable indicating 1 if a
house is within 100 km of the nearest city or not. [0,1]
near_a_city_200: a factor variable indicating 1 if a
house is within 200 km of the nearest city or not. [0,1]
First we need to load in the data and prepare some of the columns.
library(readr)
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:readr':
##
## col_factor
library(knitr)
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(faraway)
housing_data = read.csv("California_Houses.csv")
To augment the data a bit, we need to take the predictors
Distance_to_LA, Distance_to_SanDiego,
Distance_to_SanJose, and
Distance_to_SanFrancisco and convert them into two separate
columns:
a factor variable nearest_city- This segments the
data into regions of California and if there is any relevance in being
closer to one city vs. another.
a numeric variable dist_to_nearest_city - This will
give us the distance to this nearest city in kms.
nearest_city = rep("", nrow(housing_data))
dist_to_nearest = rep(0, nrow(housing_data))
near_city = rep(0, nrow(housing_data))
nearest_city_options = c("LA", "San Diego", "San Jose", "San Fransisco")
for (i in 1:nrow(housing_data)) {
subset = housing_data[i,
c("Distance_to_LA",
"Distance_to_SanDiego",
"Distance_to_SanJose",
"Distance_to_SanFrancisco")]
nearest_city[i] = nearest_city_options[which.min(subset)]
dist_to_nearest[i] = min(subset) / 1000 # convert to KM
}
housing_data$nearest_city = as.factor(nearest_city)
housing_data$dist_to_nearest_city = dist_to_nearest
We will then perform a quick assessment of the variables we just created. First, we will inspect what percentage of the properties is closest to each city
data.frame(
Los_Angeles = mean(housing_data$nearest_city == "LA"),
San_Diego = mean(housing_data$nearest_city == "San Diego"),
San_Jose = mean(housing_data$nearest_city == "San Jose"),
San_Fransisco = mean(housing_data$nearest_city == "San Fransisco"))
## Los_Angeles San_Diego San_Jose San_Fransisco
## 1 0.4759 0.09685 0.1824 0.2449
And overall numbers for nearest city:
summary(housing_data$nearest_city)
## LA San Diego San Fransisco San Jose
## 9823 1999 5054 3764
We will also run a quick sanity check that ensures that based on latitude and longitude we do indeed have the correct nearest city.
plot(Latitude ~ Longitude, housing_data,
col = nearest_city,
pch = as.numeric(nearest_city))
legend("topright",
legend = levels(housing_data$nearest_city),
col = c(1:4),
pch = c(1:4))
Next we will gather some data on how far data points are from the nearest city.
summary(housing_data$dist_to_nearest_city)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4 17.2 36.1 69.5 93.1 489.6
And we will plot the distribution of distance as both a box plot and a histogram for help framing these distances.
par(mfrow = c(1,2))
boxplot(housing_data$dist_to_nearest_city,
ylab = "Distance [km]",
main = "Boxplot of Distances\n to Nearest City")
hist(housing_data$dist_to_nearest_city,
xlab = "Distance [km]",
main = "Histogram of Distances\n to Nearest City")
Between the summary information and the box plot, we can assess that
more than 3/4 of the properties in the data set are within 100 km of the
nearest city. We will use this to create a new variable called
near_city_100 a factor variable which evaluates to 1 if
within 100 km of a city and 0 otherwise.
housing_data$near_a_city_100 = as.factor(housing_data$dist_to_nearest_city < 100)
We also see that around the 200km mark is where the whisker ends on our box plot above so we will investigate what proportion of data falls within 200km
mean(housing_data$dist_to_nearest_city < 200)
## [1] 0.9161
given that more than 91% of the data falls within 200 km of a city we will also create another factor variable for this value.
housing_data$near_a_city_200 = as.factor(housing_data$dist_to_nearest_city < 200)
Our hope here is that one of the two factor variables created will be a sufficient demarcation line to where certain variables start to have differing effects on the response when we build a model. We will explore that further later.
mean(as.numeric(housing_data$near_a_city_100) - 1)
## [1] 0.7664
76.64% percent of data points are within 100 km’s of the center of the nearest city.
mean(as.numeric(housing_data$near_a_city_200) - 1)
## [1] 0.9161
91.61% percent of data points are within 200 km’s of the center of the nearest city.
Having harvested the data from the distances to each city variable we will now eliminate them from the data set in order to make plotting and analysis more manageable.
housing_data = subset(housing_data,
select = -c(Distance_to_LA,
Distance_to_SanFrancisco,
Distance_to_SanDiego,
Distance_to_SanJose))
data.frame(name = names(housing_data))
## name
## 1 Median_House_Value
## 2 Median_Income
## 3 Median_Age
## 4 Tot_Rooms
## 5 Tot_Bedrooms
## 6 Population
## 7 Households
## 8 Latitude
## 9 Longitude
## 10 Distance_to_coast
## 11 nearest_city
## 12 dist_to_nearest_city
## 13 near_a_city_100
## 14 near_a_city_200
With the data loaded and prepped we want to start building the model.
Before we do that, we want to check the pairs of all the different
predictor variables to see what predictors have strong correlations. We
will leave out - Latitude - Longitude And
represent nearest_city as a color and
near_a_city_100 by symbol.
plot(housing_data[ , c(2:7, 10,12)],
col = as.numeric(housing_data$nearest_city),
main = "Plot of Every Variable vs Every Other \nVariable in Housing Data (some withheld)")
A couple of obvious co-linearities jump out. To better see that
numerically, we will use the cor function from the
faraway package
round(cor(housing_data[ , c(2:10,12)]), 2)
## Median_Income Median_Age Tot_Rooms Tot_Bedrooms Population
## Median_Income 1.00 -0.12 0.20 -0.01 0.00
## Median_Age -0.12 1.00 -0.36 -0.32 -0.30
## Tot_Rooms 0.20 -0.36 1.00 0.93 0.86
## Tot_Bedrooms -0.01 -0.32 0.93 1.00 0.88
## Population 0.00 -0.30 0.86 0.88 1.00
## Households 0.01 -0.30 0.92 0.98 0.91
## Latitude -0.08 0.01 -0.04 -0.07 -0.11
## Longitude -0.02 -0.11 0.04 0.07 0.10
## Distance_to_coast -0.24 -0.23 0.00 -0.02 -0.04
## dist_to_nearest_city -0.23 -0.28 0.00 -0.03 -0.07
## Households Latitude Longitude Distance_to_coast
## Median_Income 0.01 -0.08 -0.02 -0.24
## Median_Age -0.30 0.01 -0.11 -0.23
## Tot_Rooms 0.92 -0.04 0.04 0.00
## Tot_Bedrooms 0.98 -0.07 0.07 -0.02
## Population 0.91 -0.11 0.10 -0.04
## Households 1.00 -0.07 0.06 -0.06
## Latitude -0.07 1.00 -0.92 0.30
## Longitude 0.06 -0.92 1.00 0.01
## Distance_to_coast -0.06 0.30 0.01 1.00
## dist_to_nearest_city -0.07 0.46 -0.21 0.75
## dist_to_nearest_city
## Median_Income -0.23
## Median_Age -0.28
## Tot_Rooms 0.00
## Tot_Bedrooms -0.03
## Population -0.07
## Households -0.07
## Latitude 0.46
## Longitude -0.21
## Distance_to_coast 0.75
## dist_to_nearest_city 1.00
Below are some of the variables with strong correlation
Tot_Rooms - Tot_BedroomsTot_Rooms - PopulationTot_Rooms - HouseholdsTot_Bedrooms - PopulationTot_Bedrooms - HouseholdsPopulation - HouseholdIt appears that the variables that have to do with density see a
strong positive correlation. This makes sense. As the total number of
people within a block (population) increases, you also see
an increase in total households within a block (Households)
which leads to an increase in both total bedrooms
(Total_Bedrooms) and total rooms
(Total_Rooms). We are not attempting to demonstrate
causation, simply how density indicators are linked to each other.
These variables may still have interactions. For instance an area where the population density is low and the number of rooms is high or the number of households is low but the number of rooms is high may indicate an increase in house value. We will keep this in mind for later.
We will also explore variable correlations within the context of whether they are close to or far away from the city, in order to check to see if any patterns emerge within either that were otherwise hidden.
str(housing_data)
## 'data.frame': 20640 obs. of 14 variables:
## $ Median_House_Value : num 452600 358500 352100 341300 342200 ...
## $ Median_Income : num 8.33 8.3 7.26 5.64 3.85 ...
## $ Median_Age : int 41 21 52 52 52 52 52 52 42 52 ...
## $ Tot_Rooms : int 880 7099 1467 1274 1627 919 2535 3104 2555 3549 ...
## $ Tot_Bedrooms : int 129 1106 190 235 280 213 489 687 665 707 ...
## $ Population : int 322 2401 496 558 565 413 1094 1157 1206 1551 ...
## $ Households : int 126 1138 177 219 259 193 514 647 595 714 ...
## $ Latitude : num 37.9 37.9 37.9 37.9 37.9 ...
## $ Longitude : num -122 -122 -122 -122 -122 ...
## $ Distance_to_coast : num 9263 10226 8259 7768 7768 ...
## $ nearest_city : Factor w/ 4 levels "LA","San Diego",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ dist_to_nearest_city: num 21.3 20.9 18.8 18 18 ...
## $ near_a_city_100 : Factor w/ 2 levels "FALSE","TRUE": 2 2 2 2 2 2 2 2 2 2 ...
## $ near_a_city_200 : Factor w/ 2 levels "FALSE","TRUE": 2 2 2 2 2 2 2 2 2 2 ...
city = housing_data$near_a_city_100 == TRUE
non_city = ! city
plot(housing_data[ city, c(2:7, 10,12)],
main = "City Based Plots",
col = "darkgray")
plot(housing_data[ non_city, c(2:7, 10,12)],
main = "Non-City Based Plots",
col = "darkblue")
No further trends obviously emerge from breaking out the data into city vs non-city.
Before we move on from this we will attempt to see if any of the cities have colinearities specific to their locality. In order to do this we will repeat the previous step one for each city using only the city data, in hopes of isolating information specific to the major cities which cover most of the data.
levels(housing_data$nearest_city)
## [1] "LA" "San Diego" "San Fransisco" "San Jose"
la = housing_data$nearest_city == "LA" & city
sd = housing_data$nearest_city == "San Diego" & city
sf = housing_data$nearest_city == "San Fransisco" & city
sj = housing_data$nearest_city == "San Jose" & city
plot(housing_data[la, c(2:7, 10,12)],
main = "Los Angeles Based Plots",
col = 1)
plot(housing_data[sd, c(2:7, 10,12)],
main = "San Diego Based Plots",
col = 2)
plot(housing_data[sf, c(2:7, 10,12)],
main = "San Fransisco Based Plots",
col = 3)
plot(housing_data[sj, c(2:7, 10,12)],
main = "San Jose Based Plots",
col = 4)
The last graph we will create for assistance is a graph of
Median_House_Value vs all other predictors.
par(mfrow = c(3,3))
plot(Median_House_Value ~ Median_Income, housing_data, col = 1)
plot(Median_House_Value ~ Median_Age, housing_data, col = 2)
plot(Median_House_Value ~ Tot_Rooms, housing_data, col = 3)
plot(Median_House_Value ~ Tot_Bedrooms, housing_data, col = 4)
plot(Median_House_Value ~ Population, housing_data, col = 5)
plot(Median_House_Value ~ Households, housing_data, col = 6)
plot(Median_House_Value ~ Latitude, housing_data, col = 7)
plot(Median_House_Value ~ Distance_to_coast, housing_data, col = 8)
plot(Median_House_Value ~ dist_to_nearest_city, housing_data, col = 9)
To start things off, we are creating a few helper functions for helping us build and evaluate the models we create
Now we want to get in to actually creating the model and choosing the best variables for prediction. Before we create any models however, we want to create a train / test set to use for evaluation.
set.seed(420)
housing_data_idx = sample(nrow(housing_data), size = trunc(0.80 * nrow(housing_data)))
housing_data_trn = housing_data[housing_data_idx, ]
housing_data_tst = housing_data[-housing_data_idx, ]
What we will do next is create our models. We will refine and modify these models along the way and attempt to keep those which we deem significant.
First we will create a full additive model with all predictors. This is a somewhat natural choice as a model which uses all available predictors is a simple and usable starting point. We will also check if any of these coefficients have a large p-value
full_add_model = lm(Median_House_Value ~ ., data = housing_data_trn)
sum(coef(summary(full_add_model) )[, 4] > 0.05)
## [1] 0
Next we will step through the model to see if any parameters from the above full model aren’t necessarily useful. We will run this for both AIC and BIC.
n = length(resid(full_add_model))
full_add_model_aic = step(full_add_model, direction = "backward", trace = 0)
full_add_model_bic = step(full_add_model, direction = "backward", trace = 0, k = log(n))
c(AIC = length(coef(full_add_model_aic)),
BIC = length(coef(full_add_model_bic)))
## AIC BIC
## 16 15
We see that using a backwards AIC search we actually don’t drop any
variables from our search. BIC on the other hand dropped the
Households predictor from the model. We will use the BIC
variant for comparison later.
However we suspect this model may still be too cumbersome and we
would like to explore how we might decrease the size of the model
intelligently. Instead we will hand pick some variables to drop based on
our analysis in Section 2.1.3. We will remove
Tot_Rooms, Tot_Bedrooms, and
Population since they have >.9 correlation with
Households in an effort to pare down these 4 predictors to
a single predictor. Next we will remove Latitude and
Longitude since we have tried to capture geographic
information through the nearest city variables. We will also drop the
near_a_city_200 from the model since it is a binary that
only differentiates ~9% of the data and have
near_a_city_100 to attempt to capture this distinction as
well. We will also remove Median_Age because it appeared to
have no relationship with Median_House_Value per ourgraph
of Median_House_Value vs all other predictors.
hand_picked_add_model = lm(Median_House_Value ~ . - Tot_Rooms - Tot_Bedrooms - Population - Latitude - Longitude - near_a_city_200 - Median_Age, data = housing_data_trn)
Then we will compare these models to see how they perform.
mnames = c("Full", "BIC", "Hand Picked")
models_to_compare = vector(mode="list", length=3)
models_to_compare[[1]] = full_add_model
models_to_compare[[2]] = full_add_model_bic
models_to_compare[[3]] = hand_picked_add_model
comparison_table = compare_models(models_to_compare, mnames, housing_data_trn, housing_data_tst)
kable(comparison_table)
| Number of Parameters | Train RMSE | Test RMSE | VIF (mean) | LOOCV | Adj. R-Squared | |
|---|---|---|---|---|---|---|
| Full | 16 | 68474 | 68405 | 14.527 | 68684 | 0.6478 |
| BIC | 15 | 68491 | 68547 | 10.603 | 68667 | 0.6477 |
| Hand Picked | 9 | 74244 | 74931 | 2.067 | 74284 | 0.5862 |
We see essentially no change to performance with the BIC and an improvement of the Variance Inflation Factor over the full model. We also see the hand-picked model performing decently well across most of the metrics but a substantial decrease in VIF as well as a nice improvement in the number of indicators we have. It is a bit of a toss-up at this point as far as which models are best; reducing the number of predictors 40% is probably worth the performance decrease of about 10% in RMSE.
However the model can still be improved.
dist_to_nearest_city and Distance_to_coast
both appear to have somewhat logarithmic relationships with
Median_House_Value. So we will transform these two
predictors in order to see how the model performs with this
transform.
hand_picked_add_model_transform = lm(Median_House_Value ~ . - Tot_Rooms - Tot_Bedrooms - Population - Latitude - Longitude - Distance_to_coast + log(Distance_to_coast) - dist_to_nearest_city + log(dist_to_nearest_city) - near_a_city_200 - Median_Age, data = housing_data_trn)
mnames = c("BIC", "Hand Picked", "Hand Picked with Transform")
models_to_compare = vector(mode="list", length=3)
models_to_compare[[1]] = full_add_model_bic
models_to_compare[[2]] = hand_picked_add_model
models_to_compare[[3]] = hand_picked_add_model_transform
comparison_table = compare_models(models_to_compare, mnames, housing_data_trn, housing_data_tst)
kable(comparison_table)
| Number of Parameters | Train RMSE | Test RMSE | VIF (mean) | LOOCV | Adj. R-Squared | |
|---|---|---|---|---|---|---|
| BIC | 15 | 68491 | 68547 | 10.603 | 68667 | 0.6477 |
| Hand Picked | 9 | 74244 | 74931 | 2.067 | 74284 | 0.5862 |
| Hand Picked with Transform | 9 | 69214 | 69610 | 1.567 | 69256 | 0.6403 |
This new hand picked model with transforms appears to be a very strong performer. Outperforming the hand picked model in VIF, and comparing just about on par with the BIC model.
This new handpicked model is good but it is worth it at this point to see if any of the variables fail to reject the null hypothesis.
which(coef(summary(hand_picked_add_model_transform))[, "Pr(>|t|)"] > 0.05)
## nearest_citySan Jose
## 6
San Jose being the closest appears to be insignificant. At this point we could include the variable because we deem it necessary, by going back and removing the Distance to San Jose column then only tracking distance to San Francisco, San Diego and Los Angeles. However for the sake of linearity we will not do that and instead press on with this model. We will also check how the current handpicked model performs if we cease tracking all nearest cities in an ANOVA test. As well as comparing the two models.
no_city_add_model_transform = lm(Median_House_Value ~ . - Tot_Rooms - Tot_Bedrooms - Population - Latitude - Longitude - Distance_to_coast - nearest_city + log(Distance_to_coast) - dist_to_nearest_city + log(dist_to_nearest_city) - near_a_city_200 - Median_Age, data = housing_data_trn)
anova(no_city_add_model_transform, hand_picked_add_model_transform)
## Analysis of Variance Table
##
## Model 1: Median_House_Value ~ (Median_Income + Median_Age + Tot_Rooms +
## Tot_Bedrooms + Population + Households + Latitude + Longitude +
## Distance_to_coast + nearest_city + dist_to_nearest_city +
## near_a_city_100 + near_a_city_200) - Tot_Rooms - Tot_Bedrooms -
## Population - Latitude - Longitude - Distance_to_coast - nearest_city +
## log(Distance_to_coast) - dist_to_nearest_city + log(dist_to_nearest_city) -
## near_a_city_200 - Median_Age
## Model 2: Median_House_Value ~ (Median_Income + Median_Age + Tot_Rooms +
## Tot_Bedrooms + Population + Households + Latitude + Longitude +
## Distance_to_coast + nearest_city + dist_to_nearest_city +
## near_a_city_100 + near_a_city_200) - Tot_Rooms - Tot_Bedrooms -
## Population - Latitude - Longitude - Distance_to_coast + log(Distance_to_coast) -
## dist_to_nearest_city + log(dist_to_nearest_city) - near_a_city_200 -
## Median_Age
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 16506 8.21e+13
## 2 16503 7.91e+13 3 2.97e+12 206 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
mnames = c("Cities", "No Cities")
models_to_compare = vector(mode="list", length=2)
models_to_compare[[1]] = hand_picked_add_model_transform
models_to_compare[[2]] = no_city_add_model_transform
comparison_table = compare_models(models_to_compare, mnames, housing_data_trn, housing_data_tst)
kable(comparison_table)
| Number of Parameters | Train RMSE | Test RMSE | VIF (mean) | LOOCV | Adj. R-Squared | |
|---|---|---|---|---|---|---|
| Cities | 9 | 69214 | 69610 | 1.567 | 69256 | 0.6403 |
| No Cities | 6 | 70499 | 70643 | 1.672 | 70529 | 0.6269 |
We see that dropping cities from the model does not pass an Anova test and we would reject the null hypothesis. We reduce the number of predictors from 9 to 6 which is substantial and worthwhile, but we also lose an important categorical predictor for the sake of lucidity of our model. Meaning it is a simple and straightforward part of the model to track “Of these 4 major cities in California, which is the nearest” Therefore we will opt to maintain nearest city in the model for now.
However the work is not done yet. It is still worth inspecting whether noticeable performance gains can be made by increasing this model or allowing for interactions between the chosen variables. So for the sake of comparison we well allow for all possible interaction variables of the hand picked model with transformations.
hand_picked_int_model = lm(Median_House_Value ~ (. - Tot_Rooms - Tot_Bedrooms - Population - Latitude - Longitude - Distance_to_coast + log(Distance_to_coast)- dist_to_nearest_city + log(dist_to_nearest_city) - near_a_city_200 - Median_Age) ^ 2, data = housing_data_trn)
We will also allow a backwards BIC search to attempt to pare this model down for us.
n = length(resid(hand_picked_int_model))
hand_picked_int_model_aic = step(hand_picked_int_model, direction = "backward", trace = 0, k = log(n))
And finally we will compare these models.
mnames = c("Hand-Picked w/ Trans.", "Fully Interactive", "Step")
models_to_compare = vector(mode="list", length=3)
models_to_compare[[1]] = hand_picked_add_model_transform
models_to_compare[[2]] = hand_picked_int_model
models_to_compare[[3]] = hand_picked_int_model_aic
comparison_table = compare_models(models_to_compare, mnames, housing_data_trn, housing_data_tst)
kable(comparison_table)
| Number of Parameters | Train RMSE | Test RMSE | VIF (mean) | LOOCV | Adj. R-Squared | |
|---|---|---|---|---|---|---|
| Hand-Picked w/ Trans. | 9 | 69214 | 69610 | 1.567 | 69256 | 0.6403 |
| Fully Interactive | 34 | 65818 | 66387 | 164.805 | 65993 | 0.6743 |
| Step | 25 | 65891 | 66434 | 141.227 | 66012 | 0.6737 |
What can be seen here is that with a serious increase in number of predictors (a jump from 9 to 25) we see marginal improvements in adjusted \(R^2\), RMSE and a massive performance drop in LOOCV.
At this point we will compare most of the models we developed in a single table for comparison and final selection.
mnames = c("Additive", "Additive / BIC", "Hand Picked", "Hand Picked with Transform", "Cities Removed", "Fully Interactive", "Interactive BIC")
models_to_compare = vector(mode="list", length=6)
models_to_compare[[1]] = full_add_model
models_to_compare[[2]] = full_add_model_bic
models_to_compare[[3]] = hand_picked_add_model
models_to_compare[[4]] = hand_picked_add_model_transform
models_to_compare[[5]] = no_city_add_model_transform
models_to_compare[[6]] = hand_picked_int_model
models_to_compare[[7]] = hand_picked_int_model_aic
comparison_table = compare_models(models_to_compare, mnames, housing_data_trn, housing_data_tst)
kable(comparison_table)
| Number of Parameters | Train RMSE | Test RMSE | VIF (mean) | LOOCV | Adj. R-Squared | |
|---|---|---|---|---|---|---|
| Additive | 16 | 68474 | 68405 | 14.527 | 68684 | 0.6478 |
| Additive / BIC | 15 | 68491 | 68547 | 10.603 | 68667 | 0.6477 |
| Hand Picked | 9 | 74244 | 74931 | 2.067 | 74284 | 0.5862 |
| Hand Picked with Transform | 9 | 69214 | 69610 | 1.567 | 69256 | 0.6403 |
| Cities Removed | 6 | 70499 | 70643 | 1.672 | 70529 | 0.6269 |
| Fully Interactive | 34 | 65818 | 66387 | 164.805 | 65993 | 0.6743 |
| Interactive BIC | 25 | 65891 | 66434 | 141.227 | 66012 | 0.6737 |
The Hand Picked with Transform Model appears to be the
best choice for balancing the desired ends and will be declared
Our Winner. Its adjusted R-squared is very close to a
fully additive model, it has a very low VIF. A competitive RMSE and a
low number of parameters. Additionally, The Cities Removed
model is very good for how few predictors it has. Also the Interactive
BIC would be a good choice model if there was no concern about how many
predictors were used.
Our selected model is the model wherein we hand picked variables and performed transformations on some of the predictors. We will now take a closer look at that model and it’s performance.
coef(summary(hand_picked_add_model_transform))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 465300.88 7360.211 63.218 0.000e+00
## Median_Income 37352.32 296.083 126.155 0.000e+00
## Households 16.12 1.394 11.568 7.853e-31
## nearest_citySan Diego -45102.61 1932.162 -23.343 1.343e-118
## nearest_citySan Fransisco -19074.91 1459.619 -13.068 7.777e-39
## nearest_citySan Jose -2927.43 1524.099 -1.921 5.478e-02
## near_a_city_100TRUE -25644.62 2081.853 -12.318 1.027e-34
## log(Distance_to_coast) -30799.79 615.125 -50.071 0.000e+00
## log(dist_to_nearest_city) -20790.74 691.469 -30.067 2.045e-193
This Model consists of of the following \(\beta\) values.
-Intercept of $465300.88. This is the baseline value for
the region of LA and all further modifications will be made from that
point.
-Median_Income - $37352.32 / $10,000. Meaning that for an
every dollar increase in median income within one block there is a
corresponding increase in median house value of about 3.7 times that
amount.
-Households $16.12 / Household. Meaning that for every
additional household within a block the median house value increases by
about 16.12 dollars.
-nearest_citySan Diego -$45102.61. The median price of a
house on a block is $45102.61 lower when in the vicinity of San Diego
instead of Los Angeles.
-nearest_citySan Fransisco -$19074.91. The median price of
a house on a block is $19074.91 lower when in the vicinity of San
Fransisco instead of Los Angeles.
-nearest_citySan Jose -$2927.43. The median price of a
house on a block is $2927.43 lower when in the vicinity of San Jose
instead of Los Angeles.
-near_a_city_100TRUE -$25644.62. The median price of a
house on a block is $25644.62 lower when within 100 km of a city.
-log(Distance_to_coast) -$30799.79. The median price of a
house on a block is $30799.79 lower for every increase of 1 in the value
of the log() of the Distance_to_Coast variable. Put more
simply at the coast, no decrease, at 1 mile from the coast is the
baseline. At 10 miles from the coast there is a decrease of $62k and at
100 miles from the coast there is $92k decrease in value. At .1 miles
from the coast there is a $31k increase in value.
-log(dist_to_nearest_city) -$20790.74. The median price of
a house is $20790.74 lower for every increase of 1 in the value of the
log() of the dist_to_nearest_city.
Here is a plot of the actual housing prices vs the predicted price from our model
pred = predict(hand_picked_add_model_transform, newdata = housing_data_tst)
act = housing_data_tst[,"Median_House_Value"]
# Graph
opacity = .5
point_size = .3
plot(act, pred, xlab = "Actual Prices", ylab = "Predicted Prices", col=alpha("dodgerblue", opacity), pch = 16, cex = point_size, main = "Actual vs. Predicted Home Prices")
abline(a = 0, b = 1, col = "darkorange", lwd = 3)
Next we want to assess the normality of the error distribution. We will first do so visibly using a Q-Q plot.
plot_qq(hand_picked_add_model_transform)
We can see based on the above QQ plot that the distribution of errors is not normal.
We will attempt to confirm this judgement with the Shapiro Wilk test.
selectees = sample(length(resid(hand_picked_add_model_transform)), size = 5000)
shapiro.test(resid(hand_picked_add_model_transform)[selectees])
##
## Shapiro-Wilk normality test
##
## data: resid(hand_picked_add_model_transform)[selectees]
## W = 0.94, p-value <2e-16
Normality is suspect because of a rejection of the null hypothesis using the Shapiro Wilk’s test which assumes normality.
plot_fitted_resid(hand_picked_add_model_transform)
We can see above that the variance does not appear to be equally distributed. For comparison we will also look at the fully interactive model’s fitted vs. Residual Graph. To see if maximizing interaction would have fixed this problem.
plot_fitted_resid(hand_picked_int_model)
Interestingly enough the fully interactive model appears to have a very similar plot, indicating that we have probably done the best we could to mitigate this problem.
mnames = c("Hand Picked")
models_to_compare = vector(mode="list", length=1)
models_to_compare[[1]] = hand_picked_add_model_transform
comparison_table = compare_models(models_to_compare, mnames, housing_data_trn, housing_data_tst)
kable(comparison_table)
| Number of Parameters | Train RMSE | Test RMSE | VIF (mean) | LOOCV | Adj. R-Squared | |
|---|---|---|---|---|---|---|
| Hand Picked | 9 | 69214 | 69610 | 1.567 | 69256 | 0.6403 |
With this model we have 9 \(\beta\)
parameters, and can make all assessments using only 6 variables
Median_Income, Households,
nearest_city, near_a_city_100,
Distance_to_coast, dist_to_nearest_city. We
have a RMSE of about $69,000, which means we can expect an appropriate
ballpark estimate, but not a zeroed in exact guess of
Median_House_Value. The model has a mean variance inflation
factor of 1.567. A Leave one out cross-validated score of 69,256 and an
adjusted \(R^2\) of 0.6403.
When choosing between multiple models we had the option to opt for
more predictors to make a better model or less predictors to create a
more easily understood and less computationally intense model. Our goal
was to shrink the model to the smallest extent reasonable while still
maintaining a comparable effectiveness. We were able to accomplish this
through the following steps:
1) The transformation of 4 variables into 4 variables that were more
intuitive for use.
2) An initial assessment of the variables and trends.
3) An initial additive model
4) An attempt to use AIC and BIC search to reduce model size.
5) Model hand pruning using information gathered in step
2.
6) An attempt to increase the size of the model to see performance
increases.
7) Final comparison and selection.
There are certainly more models that we could have created and tested; indeed there were many model choices that we created and left out of this document in order to preserve linearity of the process. However this was the most straightforward path we created and we removed some of these excess models to ensure we did not include excessive unproductive tinkering.
The group felt this model was a nice balance, using only 6 variables and 9 predictors, resulting in a comparably successful model. We also felt this model is simple enough that it can be easily and intuitively understood and implied.
So with our chosen model, a person could take this model along with a few metrics from a city block (like median income, median age of the houses within that block, number of households, etc.) and predict the Median House Value to determine whether a house for sale could be over or under valued.
We noticed a very distinct and noticeable line in the residuals as well as the actual vs predicted Home Prices. Below we can see a large number of the home prices are ~$500,000 and nothing greater than that in the data
sum(housing_data$Median_House_Value > 499900 & housing_data$Median_House_Value < 500100)
## [1] 992
We tried to remove these data points to see if it had any effect on our models and results, but everything turned out roughly the same.